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CN I Abstract 

I""! . The problem of publishing personal data without giving up privacy is becoming increas- 
ingly important. An interesting formalization recently proposed is the fc-anonymity. This 

. approach requires that the rows in a table are clustered in sets of size at least k and that 

^ ' all the rows in a cluster become the same tuple, after the suppression of some records. 

O . The natural optimization problem, where the goal is to minimize the number of suppressed 
entries, is known to be NP-hard when the values are over a ternary alphabet, k — 3 and 

p<J . the rows length is unbounded. In this paper we give a lower bound on the approximation 

^ ' factor that any polynomial-time algorithm can achive on two restrictions of the problem, 

. namely (i) when the records values are over a binary alphabet and fc = 3, and (it) when the 

' records have length at most 8 and fc = 4, showing that these restrictions of the problem are 

^ ! APX-hard. 

^ ■ 1 Introduction 

o : 

■ In many research fields, for example in epidemic analysis, the analysis of large amounts of 

•rH , personal data is essential. However, a relevant issue in the management of such data is the 

r> I protection of individual privacy. One approach to deal with such problem is the fc-anonymity 

' model [QlllOllH]. where a single table is given. The rows of the table represent records belonging 

to different individuals. Then some of the entries in the table are suppressed so that, for each 
record r in the resulting table, there exist at least fc— 1 other records identical to r. At the end of 
this process, identical rows can be clustered together; clearly the resulting data is not sufficient 
to identify each individual. Different versions of the problem have also been introduced p], for 
example allowing the generalization of entry values (an entry value can be replaced with a less 
specific value) However, in this paper we will focus only on the suppression model. 

A simple parsimonious principle leads to the optimization problem where the number of 
entries in the table to be suppressed (or generalized) has to be minimized. The fc-anonymity 
problem is known to be NP-hard for rows of unbounded length with values over ternary alphabet 
and fc = 3 [2j. Moreover, a polynomial-time 0(fc)-approximation algorithm on arbitrary input 
alphabet, as well as some other approximation algorithms for some restricted cases, are known 
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[2]. Recently, approximation algorithms with factor 0(logA;) have been proposed [7], even for 
generalized versions of the problem [6]. 

In this paper, we further investigate the approximation and computational complexity of the 
A;-anonymity problem, settling the APX-hardness for two interesting restrictions of the problem: 
(i) when the matrix entries are over a binary alphabet and /c = 3, or (ii) when the matrix has 
8 columns and k = 4. We notice that these are the first inapproximability results for the 
A;-anonymity problem. More precisely, we prove the two inapproximability results by designing 
two L-reductions ^ from the Minimum Vertex Cover problem to 3-anonymity problem over 
binary alphabet and 4-anonymity problem when the rows are of length 8 respectively. Those 
two restrictions are of particular interests as some data can be inherently binary (e.g. gender) 
and publicly disclosed usually have only a few columns, therefore solving such restrictions could 
help for most practical cases. 

The rest of the paper is organized as follows. In Section [2] we introduce some preliminary 
definition and we give the formal definition of the fc-anonymity problem. In Section [3] we show 
that the 3-anonymity is APX-hard, even when the matrix is restricted to binary data, while 
in Section H] we show that the 4-anonymity problem is APX-hard, even when the rows have 
length bounded by 8. 

2 Preliminary Definitions 

In this section we introduce some preliminary definitions that will be used in the rest of the 
paper. A graph G = (V, E) is cubic when each vertex in V has degree three. 

Given an alphabet S, a row r is a vector of elements taken from the set S, and the j-th 
element of r is denoted by r\j]. Let ri,r2 be two equal-length rows. Then H(ri,r2) is the 
Hamming distance of ri and r2, i.e. \{i : ri[i] ^ r2[i]}\. Let R he a set of I rows, then a 
clustering of i? is a partition P = (Pi, . . . ,Pt) of R. Since all rows in a table have the same 
number of elements and the order of the elements of a rows is important, we may think of a 
row over the set S as a string over alphabet S. 

Given a clustering P = (Pi, . . . ,Pt) of R, we define the cost of the row r belonging to 
a set Pi, as | {j : 3ri , r2 G Pi, ri [j] ^ r2 [j] } | , that is the number of entries of r that have 
to be supressed so that all rows in Pj are identical. Similarly we define the cost of a set Pj, 
denoted by c(Pj), as |Pi||{j : 3ri,r2 € Pj, ri[j] ^ f2[jW\- The cost of P, denoted by c(P), 
is defined as Yip <^p '^{Pi)- Notice that, given a clustering P = (Pi, . . . , Pf) of R, the quantity 
\Pi\maXrj^^r2£Pi{H{ri,r2)} is a lower bound for c(Pj), since all the positions for which ri and 
r2 differ will be deleted in each row of Pj. We are now able to formally define the /c- Anonymity 
Problem (k-AP) as follows: 

Problem 1. k-AP. 

Input: a set R of rows over an alphabet S. 

Output: a clustering P = (Pi, . . . , Pt) of R such that for each set Pi, |Pj| > k 
Goal: to minimize c{P). 

The following Property will be used in several proofs. 

Proposition 1. ^ Let R he an instance of k-AP, and let P he a solution of k-AP over instance 
R. Then we can compute in polynomial time a solution P' , with c{P') < c{P), such that for 
each cluster Pi of P' , k < \Pl\ <2k-l. 
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We will study two restrictions of the /c-anonymity problem. In the first restriction, denoted 
by 3-ABP, the rows are over a binary alphabet S = {O5, 1;,} and A; = 3. In the second restriction, 
denoted by 4-AP(8), fc = 4 and the rows are over an arbitrary alphabet and have length 8. 

In the remaining of the paper we will prove the APX-hardness of both restrictions, pre- 
senting two different reductions from the Minimum Vertex Cover on Cubic Graphs (MVCC) 
problem, which is known to be APX-hard [1]. Consider a cubic graph Q = (y,E), where 
\V\ = n and \E\ = m, the MVCC problem asks for a subset C C y of minimum cardinality, 
such that for each edge {vi,Vj) G E, at least one of Vi or vj belongs to C. 

3 APX-hardness of 3-ABP 

In this section we will show that 3-ABP is APX-hard via an L-reduction from Minimum Vertex 
Cover on Cubic Graphs (MVCC), which is known to be APX-hard From Proposition [H it 
follows Remark [2l that shows that we can restrict ourselves to solutions of 3-ABP where each 
cluster contains at most 5 rows. 

Remark 2. Let R be an instance of 3-ABP, and let P be a solution of 3-ABP over instance 
R. Then we can compute in polynomial time a solution P' , with c{P') < c{P), such that for 
each cluster P^ of P' , 3 < \pf\ < 5. 

Let Q = (y, E) be an instance of MVCC, the reduction builds an instance R of 3-ABP 
associating with each vertex Vi a set of rows and with each e = {vi,Vj) € E a, row r^.j. 
Actually, starting from the cubic graph Q, the reduction builds an intermediate multigraph, 




Figure 1: Gadgets for Vi, vj, {vi,Vj) 



denoted as gadget graph VQ ~ a snippet of a gadget graph obtainable through our reduction is 
represented in Fig. [TJ The reduction associates with each vertex Vi of Q a vertex gadget VGi 
containing a core vertex gadget CVGi and some other vertices and edges called respectively jolly 
vertices and jolly edges. More precisely, the vertex-set of a core vertex gadget CVGi consists 
of the seven vertices Q^i, Cj^2, Q.s, ^,4, Q,5, Q,6) Q,?- The vertices Cj^i, Ci^2 and of CVGi are 
called docking vertices. The edge-set of CVGi consists of nine edges between vertices of CVGi 
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(see Fig. [T|). Such a set of edges is defined as the set of core edges of VGi. The vertex-set of a 
vertex gadget consists of the seven vertices of CVGi and of three more vertices Jj^i, Jj_2, Ji,3, 
called jolly vertices of VGi. The edge-set of VGi consists of the edge-set of CVGi and of three 
sets of four parallel edges (see Fig. More precisely, for each docking vertex Ci^z adjacent to 
a jolly vertex Jj^^, we define a set E^^ of four parallel edges between Ci^z and Ji^z- The set of 
edges Ef = \^zg{i 2 3} ^iz called the set of jolly edges of VGi. 

Each edge {vi,Vj) of Q is encoded by an edge gadget EGij consisting of a single edge that 
connects a docking vertex of VGi with one of VGj, so that in the resulting graph each docking 
vertex is an endpoint of exactly one edge gadget (this can be achieved trivially as the original 
graph is cubic.) The resulting graph, denoted by VQ, is called gadget graph. An edge gadget 
is said to be incident on a vertex gadget VGi if it is incident on a docking vertex of VGi. In 
our reduction we will associate a row with each edge of the graph gadget. Therefore 3-ABP 
is equivalent to partitioning the edge set of the gadget graph into sets of at least three edges. 
Hence in what follows we may use edges of VQ to denote the corresponding rows. Before giving 
some details, we present an overview of the reduction. 

First, the input set R of rows is defined, so that each row corresponds to an edge of the 
gadget graph. Then, it is shown that, starting from a general solution, we can restrict ourselves 
to a canonical solution, where there exist only two possible partitions of the rows of a vertex 
gadget (and possibly some edge gadgets). Such partitions are denoted as type a and type b 
solution. Finally, the rows of a vertex gadget that belongs to a type b {type a resp.) solution 
are related to vertices in the cover (not in the cover, respectively) of the graph Q. 

We are now able to introduce our reduction. All the rows in R are the juxtaposition of n + 2 
blocks, where the i-th block, for 1 < i < n, is associated with vertex Vi G V, the (n + l)-th 
block is called jolly block, and the (n + 2)-th block is called edge block. The first n blocks are 
called vertex blocks, and each vertex block has size 21. The jolly block has size 6n, and the 
edge block has size 3n. 

The rows associated with edges of the gadget graph VG are obtained by introducing the 
following operations on rows (also called encoding operations). For simplicity's sake we will 
use a string-based notation. 

Definition 1 (Encoding operations). Let VGi be a vertex gadget, CVGi be a core vertex 
gadget, Cij be a vertex of CVGi, 1 < i < n, 1 < j < 7 , and let r be a row. Then the vertex 
encoding of Cij applied to r, denoted by v-encij{r), is obtained by assigning If, to the positions 
3j — 2, 3j — 1, 3j of the i-th block of r (and leaving all other entries as in r). The gadget 
encoding of VGi applied to r, denoted by g-enci{r), is obtained by assigning If, to the positions 
3i — 2, 3i — 1, 3i of the edge block of r (and leaving all other entries as in r). Finally, let Ji^x be 
a jolly vertex of VGi, ^ 1^ i 1^ n, 1<2;<3, and let r be a row, then the jolly encoding j-enci^x 
of Ji^x applied to r, denoted by j-enci^xi'''), is obtained by assigning 1^ to the to the positions 
6{i — 1) + x, 6{i — 1) + a; + 1 of the jolly block of row r. 

Notice that the vertex encoding and the gadget encoding operations set to li, at most 
3 entries of any row, while the jolly encoding operation sets to 1;, at most 2 entries of any 
row. Let Ci^x be a docking vertex of VGi, and let Ci^y, Ci^z be the two core vertices of 
VGi adjacent to Ci^x, let {ci^x,Ci^y) be a core edge, and let Ji^x be a jolly vertex adjacent 
to Ci^x- Then the row ri^x,y associated with {ci^x,Ci^y) is g-enci (^v-enci^y (^v-enci,x {Of^"'))) , 
each row associated with a jolly edge {ci^x,Ji,x), denoted by ri^x,y,z (for 1 < z < 4) is 
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Operation 


Positions of the i-th 
vertex blocks set to 1;, 


Positions of the edge 
blocks set to 1;, 


Positions of the jolly 
block set to lb 


v-encij{r) 

g-enci{r) 

j-enci^x{r) 


3j - 2, 3j - 1, 3j 


3i — 2, 3i — 1, 3i 


6{i-l)+x, 6(i-l)+x + l 



Table 1: Summary of the encoding operations 



j-enci^x {g-e-nci (v-enci^z {v-enci^y {v-enci^x (Ofc''"))))); Each row associated with a jolly edge 
is called jolly row and the set of the 4 jolly rows incident to vertex Ci^x is called jolly row set 
of Ci^x- Finally, let EGij = {ci^x,Cj^y) be an edge gadget The row rij^x,y associated with EGij 
is g-encj [g-enci [v-encj^y {v-enci^x (Ofe'^"))))- Table [2] the rows associated with the various 
edge are summarized. 



Edge 


Associated row 


Core edge {ci^x,Ci,y) 
Jolly edge {ci^x, Ji,x) 
Edge gadget EGij 


g-enci [y-enci-y [v-enci^x (Oft""))) 

j-enci^x {g-enci {v-ena^z {v-enci^y {v-enci^x (0^°"))))) 

g-encj {g-enCi {v-encj^y {v-enci^x (Ofe°")))) 



Table 2: Encodings of the edges 



For example consider the row rj^i^4 associated with the core edge (cj^i,Ci^4). Observe that 
v-enci^i sets to 1;, the first three positions of the i-th block of rj_i^4, while v-enci^^ sets to 1;, 
the positions 10, 11, 12 of the i-th block of rj^i^4. Finally, g-enci sets to 1;, the positions 3i — 2, 
3i — 1 and 3i of the edge block of rj^i^4. Edge (ci^i,Ci^4) is associated with the following row 

OfeOfe . ■ ■ Ofe . ■ ■ Ifelfelfe OfeOfeOfe OfeObOb lb lb lb ObObOb • ■ ■ ■ ■ ■ OfeOfe . ■ ■ Ob OfeOfe . . . hhh ■ ■ ■ ObOfeOfe • 

block block i jolly block edge block 

Observe that by construction only jolly rows may have a lb in a position of the jolly block. 
It is immediate to notice that clustering together three or more jolly rows associated with 
parallel edges has cost 0. We recall that we may use edges of VQ to denote the corresponding 
rows. 

Proposition 3. Let ei,e2 be two edges ofCVGi, let 63 be an edge ofVGj (with i 7^ j), let ej 
be a jolly edge ofVGi, let 65 be a jolly edge ofVGz, and let EGix, EGji be two edge gadgets. 
Then: 

1. H{ei,e3) > 18; 

2. H{EG,x,ej),H{EGji,ej) > 14; 

3. If ei and 62 are incident on the same vertex, then H{ei,e2) = 6; 
4- If ei, 62 are not incident on the same vertex, then H{ei,e2) = 12; 
5. If ei and ej are incident on the same vertex, then H{ei,ej) = 5; 
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6. If ei and ej are not incident on the same vertex, then H{ei,ej) > 11; 

7. If ei and EGix are incident on the same vertex, then H[EGix,ei) = 9; 

8. If ei and EGix o,re not incident on the same vertex, then H(EGix,ei) > 15; 

9. H{EGix,EGji) > 18; 

10. Ifi, X, j, I are all distinct (i.e. the two edge gadgets are not incident on the same vertex 
gadget), then H{EGix,EGji) = 24. 

11. If Cj, 65 are not in the same jolly set, then H{ej,e^) > 12. 

12. Let r he a row not incident on a common vertex with ej. Then H{ej,r) > 11. 

Proof. Observe that all the cases can be easily proved by observing that each row is obtained 
applying 3 kinds of encoding operations, where the vertex encoding and gadget encodings assign 
values 1(, in three positions, while the jolly encoding assigns 1^ into two positions. We now 
prove the various cases, following the order of the statement. 

1. Since ei and 63 are incident on different vertices, the associated rows are the result of 
applying once the encoding operations with different values of i. Therefore the positions 
where a 1;, is set are disjoint. Since there are at least 12 such positions of the vertex 
blocks and 6 positions of the edge block, we obtain H{ei,e3) > 18. 

2. Notice that there are 2 positions of the jolly block that arc set to 1;, as a result of applying 
the jolly encoding to ej, while the whole block is set to O5 in EGi^x- Since EGi,x is subject 
to two gadget encodings, while Cj is subject only to one gadget encoding, EGi^x and Cj 
have 3 different entries also in an edge block. Moreover at most two of the overall five 
vertex encoding operations have the same arguments (those corresponding to a shared 
docking vertex), resulting in an additional 9 different entries. 

3. Since e\ and 62 are incident on a common vertex, they share a gadget encoding opera- 
tion and a vertex encoding operation, therefore there are two different vertex encoding 
operations that result in 6 different entries. 

4. Since ei and 62 are not incident on a common vertex, but are in the same vertex gadget, 
they share only a gadget encoding operation, therefore there are four different vertex 
encoding operations that result in 12 different entries. 

5. Since ei and Cj are incident on a common vertex, they share a gadget encoding operation 
and two vertex encoding operations, while they differ for a jolly encoding operation and 
a vertex encoding operation, resulting in 5 different entries. 

6. Since ei and Cj are not incident on a common vertex, they share a gadget encoding 
operation (since by hypothesis ei and ej are in the same vertex gadget), and at most one 
vertex encoding operations (if ei is incident on a vertex adjacent to the docking vertex 
on which Cj is also incident), while they differ for a jolly encoding operation and three 
vertex encoding operations, resulting in at least 11 different entries. 
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7. Since ei and EGix are incident on a common (docking) vertex, they share a gadget en- 
coding operation and a vertex encoding operation, while they differ for a gadget encoding 
operation and two vertex encoding operations, resulting in 9 different entries. 

8. Since ei and EGix ai'c not incident on a common vertex, they share a gadget encoding 
operation (since EGix is incident on a docking vertex of VGi), while they differ for a 
gadget encoding operation and four vertex encoding operations, resulting in 15 different 
entries. 

9. Since EGix St-nd EGji are not incident on a common vertex, they might share a gadget 
encoding operation (if the two edge gadget are incident on the the same vertex gadget), 
while the differ for four vertex encoding operations and two gadget encoding operations, 
resulting in at least 18 different entries. 

10. Since EGix cind EGji are not incident on a common vertex gadget, they share no encoding 
operations, resulting in at 24 different entries. 

11. Since ej and are not in the same jolly set, they might share a gadget encoding operation 
and a vertex encoding operations (if the two jolly edges are in the same vertex gadget), 
while the differ for four vertex encoding operations, resulting in at least 12 different 
entries. 

12. The results follows from the previous cases (case 2, 6 and 11), as row r is either a row of 
a GVGz, for some z, a jolly row in a jolly row set of a different vertex of VG, or an edge 
gadget. 

□ 

The cost of a solution S is specified by introducing the notion of virtual cost of a single 
row r of R. Let iS be a solution of 3-ABP, and let C be the cluster of S to which r belongs. 
Let r be a non-jolly row, we define the virtual cost of r in the solution S, denoted as virts{r), 
as the cost of C divided by the number of non-jolly rows in C. Otherwise, if r is a jolly row, 
then virts{r) = 0. Given the above notion, observe that the cost c(C) of set C is equal to 
^^g(^mrt5(r) and that for a solution S, the cost c{S) of set S is equal to Ylr^R'^^'^^^sif)- 

In the following we will consider only canonical solutions of 3-ABP, that is solutions where 
the rows for each vertex gadget VGi and edge gadgets eventually incident on VGi are clustered 
into type a and type b solutions constructed as follows. 

The type a solution defines the partition of the rows for vertex gadget VGi and consists of 
six clusters: three clusters of rows of CVGi, each one is made of the three edges incident on 
vertex v, where v is one of the three vertices Cj^4, Cj^5 and Cij, and three more clusters, each 
one consisting of the jolly rows associated with one of the three docking vertices of VGi. 

The type b solution defines the partition of the rows for a vertex gadget VGi and some edge 
gadgets incident on VGi. It consists of four clusters containing rows of GVGi. One of them 
consists of the three edges incident on Cj^g. The remaining three clusters are associated with 
the three docking vertices of VGi. For each docking vertex Ci^x, the cluster associated with Ci^x 
consists of the two core edges of GVGi that are incident on Ci^x^ together with either the edge 
gadget incident on Ci^x or one jolly edge incident in Ci^x- Finally, there are three more clusters, 
each one consisting of all remaining jolly edges associated with parallel edges incident on one 
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Figure 2: A type a solution for the rows associated with VGi, where the dashed hnes represent 
borders among clusters. Recall that each edge e of VGi corresponds to a row. 



of the three docking vertices of VGi. Notice that in a type b solution each cluster associated 
with a docking vertex may contain an edge gadget or not, the only requirement is that at least 
one of the clusters contains an edge gadget. Notice that type a and type b solutions cluster 
together edges incident on a common vertex (by an abuse of language, we will call canonical 
such a cluster): the common vertex of a canonical cluster is called the center of the cluster. 

Proposition 4. Let S be a canonical solution of an instance of 3-ABT associated with an 
instance of MVCC, and let VGi, be two vertex gadgets such that the rows of VGi o-f^ 

clustered in a type a solution in S and rows of VGj are clustered in a type b solution in S. 
Then each edge gadget has a virtual cost of 12 in S, the rows of VGi have a total cost of 81, 
while the rows of VGj have a total cost of 99. 

Proof. Let EGij be an edge gadget of VQ. Observe that in a canonical solution each edge 
gadget belongs to a type b solution. Consider a type b solution for VGi containing EGij. By 
definition of type b solution, EGij is co-clustered with two rows of CVGi, so that those rows 
are incident on a common docking vertex with EGij . It follows that 12 entries (9 of the vertex 
blocks, 3 of the edge block) are deleted in each of these rows. Now, consider a cluster of a type 
b solution of VGi consisting of two rows ri, r2 of CVGi and a jolly row incident on a common 
docking vertex. Then 8 entries (6 of the vertex blocks, 2 of the jolly blocks) are deleted in each 
of these rows, hence this cluster has a total cost of 24. Since the virtual cost of the jolly row 
is 0, each of ri, r2 has virtual cost 12. A type b solution of VGi contains four clusters, three 
clusters containing row incident on the docking vertices (as described above) and a cluster of 
three rows incident on Ci^, that has a virtual cost equal to 27 (9 for each row incident on Q^e)- 
Each row of CVGi incident on a docking vertex has a virtual cost of 12 in a type b solution, 
hence the total virtual cost of the rows in CVGi of a type b solution is 27 + 12 ■ 6 = 99. 

Consider a CVGi associated with a type a solution. Observe that a type a solution consists 
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Figure 3: A type b solution for the rows associated with VGi and EGij, where the dashed lines 
represent borders among clusters. Recall that each edge e corresponds to a row. 
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of three cluster, each of cost 27. Indeed, each cluster of type a solution consists of three rows 
incident on a common vertex. □ 



In the following we state two basic results that will be used to show the L-reduction from 
MVCC to 3-ABP: (i) each solution S of 3-ABP can be modified in polynomial time into a 
canonical solution S' whose cost is at most that of S (Lemma I16p; (ii) the graph Q has a vertex 
cover of size p iff the 3-ABP problem has a canonical solution of cost 99 ■ p + 81 ■ {n — p) + 12m, 
(we recall that 81 is the total virtual cost of the rows of a type a solution, and 99 is the total 
virtual cost of the rows of a vertex gadget in a type b solution - see Theorem [T7l) . We will first 
introduce some basic Lemmas that will help in excluding some possible solutions. 

Lemma 5. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let C be a cluster of S consisting of rows of CVGi. Then virts{r) > 9 for each row r of 
C, and virts{r) > 12 if C is not a canonical cluster. 

Proof. First notice that by construction VQ does not contain any cycle of lenght 4. It follows 
that if C is not a canonical cluster, then C contains two rows ei and 62 not incident on a 
common vertex. By case 4 of Prop. [3]ei and 62 have Hamming distance 12. 

Assume now that C is a canonical cluster, and let W be the set of vertices incident on 
the edges of C, except for the center of C. By definition of canonical cluster, C contains no 
cycles, moreover |C|, \W\ > 3, therefore for each vertex Vi^x £ ^ , there exists one edge in C 
not incident in Vi^x- Since the vertex encoding v-enci^x is applied only to edge incident in Vi^x, 
the three entries set to If, by v-enci^x are deleted in each row of C, for each Vi^x £ and the 
lemma follows. □ 

Lemma 6. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let C be a cluster of S consisting of rows ofVGi, such that C contains at least one jolly 
row ofVGi and at least one row of CVGi, then the virtual cost of each non-jolly row in C is 
at least 12. 

Proof. Assume first that C contains exactly one non-jolly row ri. Then by case 5 of Prop. [3] 
H{ri,rj) > 5 for each jolly row rj and, since \C\ > 3, the cost of C is at least 15. Since ri is the 
only non-jolly row of C, then the virtual cost of ri is at least 15. Assume now that C contains 
at least two non-jolly rows, ri, r2 and let rj be a jolly row in C. If C contains exactly two rows 
of VGi by cases 3,4 of Prop.O H{ri,r2) > 6 and by construction there are two positions hi, /12 
in the jolly block where ri[h2] = r2[/iz] = 0^, while rj[hz] = lb, with z G {1,2}. Hence the total 
cost of G is at least 8|C|. Since C contains exactly two non-jolly rows ri, r2, the virtual cost 
of ri and r2 is at least 8|C|/2. Since |C| > 3, the virtual cost of ri and r2 will be at least 12. 

Assume that G contains more than two non-jolly rows. Then G contains a set G', where 
G' consists of at least 3 rows of GVGi. Notice that G' can be a cluster of a feasible solution of 
3-ABT, therefore Lemma O applies also to C" and an immediate consequence is that the same 
9 entries in the vertex blocks must be suppressed also in the same position in G. Furthermore, 
by construction, there exist two positions hi, /i2 in the jolly block where the non-jolly rows have 
Ob, while some of the jolly rows has value If,. Hence the total cost of G is at least 11|C| and 
the virtual cost of each non-jolly row in G is at least 11|C|/(|C[ — 1). But Remark [2] implies 
|C| < 5, therefore 11|C|/([C| — 1) > 12 and the virtual cost of each non-jolly row in G is at 
least 12. □ 
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Lemma 7. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let C be a cluster of S containing a row ofVGi. Then the virtual cost of each non- jolly 
row of C is at least 9. 

Proof. Notice that if C contains a jolly row, then the lemma is a consequence of Prop. E] (Cases 
2 and 12), Lemmas [5] and [6l Hence assume that C contains no jolly row. If C contains at least 
two edge gadgets then, by case 9 of Prop. El the virtual cost of each non-jolly row of C is at 
least 18, therefore we can assume that there is exactly one edge gadget in C. 

If C is not a canonical cluster, there are two rows that not incident on a common vertex, 
therefore by cases 1, 4, 8 of Prop. [3] and by construction of VGi, each non-jolly row of C has 
a virtual cost of at least 12. The final case that we have to consider for C is when C contains 
an edge gadget and two edges of a core vertex gadget and all edges are incident on a common 
vertex: in this case we can apply case 7 of Prop. [3] to obtain the lemma. □ 

An immediate consequence of Lemma [7] and of the construction of VGi, is that a type a 
solution is the optimal solution for the rows associated with edges of VGi. 

Lemma 8. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let G be a cluster of S containing exactly two edge gadgets EGi and EG2. Then each of 
the virtual costs virts{EGi), virts{EG2) is at least 21. If the edge gadgets are not incident on 
a common vertex gadget, then virts{EGi),virts{EG2) > 27. 

Proof. Notice that all l^s in a vertex block of EGi correspond to Of,s of EG2. The same fact 
holds for 3 l;,s in the edge block of EGi, if EGi and EG2 are incident on a common vertex 
gadget, otherwise 6 l^s in the edge block of EGi are deleted. By symmetry of EGi, EG2 the 
number of deleted columns is at least 18, when EGi and EG2 are incident on a common vertex 
gadget, otherwise the number of deleted columns is at least 24. 

Let rs € C different from EGi, EG2. Since r^ is not an edge gadget, there is a vertex 
block of rs containing 6 lf,s, while both EGi and EG2 have at most 3 lb in that block. It is 
immediate to notice from the construction of EGi and EG2 that this fact leads to at least 3 
additional columns that must be deleted. □ 

Lemma 9. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let G be a cluster of S containing two edge gadgets EGi, EG2 that are not incident on a 
common vertex gadget and a row r belonging to a vertex gadget, such that r is not adjacent to 
EGi nortoEG2. Then virts{EGi),virts{EG2),virts{r) > 30. 

Proof. Since EGi and EG2 are not incident on a common vertex gadget, by case 10 of Prop. [3] 
we know that H{EGi, EG2) > 24. Now consider the row r and assume w.l.o.g. that r belongs 
to vertex gadget VGi. Since r is not adjacent to EGi nor to EG2 there are at least 6 positions 
in the i-th block, where EGi and EG2 have both value Of,, while r has value lb. □ 

Lemma 10. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let G be a cluster of S containing three edge gadgets EGi , EG2 , EG3 . Then each of the vir- 
tual costs virts{EGi), virts{EG2), virtsiEG^) is at least 27. If there is no pair of edge gadgets 
in {EGi, EG2, EGs} incident on a common vertex gadget, then virts{EGi) , virts {EG2) , virtsiEG^) > 
36. 
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Proof. Observe that EGi, EG2, EG^ have minimum virtual cost either when they are ah 
incident on the same vertex gadget or the set of vertex gadgets to which EGi, EG2, EG3 are 
incident consists of three vertex gadgets (that is EGi , EG2 , EG3 encode a cycle of length 3) . 
Therefore 9 entries for each of EGi, EG2, EG3 are deleted in the edge block, while 18 entries 
of the vertex blocks are deleted for each of EGi, EG2, EG3, since EGi, EG2, EG3 represent 
edges incident on a set of 6 different docking vertices. Hence the virtual cost of each EGi, 
i = {1, 2, 3}, is at least 27. 

Observe that when there is no pair of edge gadgets in {EGi, EG2, EG3} incident on the 
same vertex gadget, the positions of the edge block with value If, in EGi, EG2, EG3 are all 
different, hence at least 18 entries are deleted for each of EGi, EG2, EG3. Hence the virtual 
cost of each EGi, i = {1, 2, 3}, is at least 36. □ 

Lemma 11. Let S he a solution of an instance of3-ABT associated with an instance of MVCC 
and let C be a cluster of S containing more than three edge gadgets. Then the virtual cost of 
each edge gadget in S is at least 36. 

Proof. Consider 4 edge gadgets in C: EGi, EG2, EG3, EG 4^. First observe that 24 entries of 
the vertex blocks are deleted for each of EGi, EG2, EG^, EG 4^, since EG\, EG2, EG^, EG 4. 
represent edges incident on a set of 8 different docking vertices. 

A simple argument shows that at least two of such edge gadgets are not incident on a 
common vertex gadget. Indeed, the set of vertex gadgets on which EGi, EG2, EGs, EG 4 are 
incident contains at least four vertex gadgets, for otherwise two edge gadgets must be incident 
on the same two vertex gadgets. Hence 12 entries of the edge block will be deleted from each 
row in G. □ 

Lemma 12. Let S be a solution of an instance of3-ABT associated with an instance of MVCC 
and let G be a cluster of S containing an edge gadget EGij incident on vertex gadgets VGi and 
VGj, two rows rx, Vy adjacent to EGij, where r^ belongs to VGi <ind ry belongs to VGj. Then 
the cost of C in S is at least 18|C|. 

Proof. It is an immediate consequence of case 1 of Prop. [3l □ 

Lemma 13. Let S be a solution of an instance of 3-ABT associated with an instance of 
MVCC and let C be a cluster of S containing an edge gadget EGij o-nd a jolly row ji . Then 
virts{EGij) > 18. 

Proof. Observe that by case 2 of Fiop.^H {EGij, ji) > 14. Let J be the subset of G consisting 
of all rows of C that are not jolly rows. Moreover, we can assume that \J\ < 4, as |C| < 5. If 
\s{j)\ = 4, then there is at least one row r in J such that there exist at least 3 positions where 
EGij and ji have the same value, while r has a different value. Hence the virtual cost of EGij 
is at least -jj^ > 18. If |s(j)| < 3, then the virtual cost of EGij is at least ^ 18 and the 
lemma holds. □ 

The following Lemma [T4l is a consequence of cases 1, 2, 7, 8 of Prop. [3] and the construction 
of the gadget graph. 

Lemma 14. Let S be a solution of an instance of 3-ABT associated with an instance of MVCC 
and let C be a cluster of S with at least an edge gadget EGij. Then virts{EGij) > 12. 
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Now, we will show our key transformation of a generic solution into a canonical solution 
without increasing its cost. The proof is based on the fact that, whenever a solution S is not 
a canonical one, it can be transformed into a canonical one by applying Alg. [TJ 

Let us denote by and ^2 respectively, the solution before and after applying Alg. [TJ 
Observe that, by construction, in the solution 5*2 computed by Alg. [T] each edge gadget belongs 
to a type b solution. 



Algorithm 1: ComputeCanonical(/S'i) 
Data: a solution Si consisting of the set {Ci, • ■ ■ , Ck} of clusters 

1 Unmark all edge gadgets and all vertex gadgets; 

2 ^2 ^ 0; 

3 while there is a cluster C in {Ci, • • • , C^} with an unmarked edge gadget do 

4 U{C) <— the set of unmarked edge gadgets in C; 

5 yiC) ^ a smallest possible set of vertex gadgets such that each edge gadget in 

U{C) has at least one endpoint in V{C); /* |C| < 5 by Remark [2], hence we 
can compute V{C) in polynomial time. We assume that if a cluster 
C contains only an edge gadget EGi,j and rows of vertex gadget VGi , 
then V{C) = {VGi}. */ 

6 i?' <— all unmarked edge gadgets incident on some vertices of V{G); 

7 Add to 5*2 a type h solution for all vertex gadgets in V{C) and all edge gadgets in E'; 
/* Notice that E' D U{G) */ 

8 Mark the edge gadgets in E" and the vertex gadgets in ViC)] 

9 end 

10 Add to 52 a type a solution for each umarked vertex gadget; 

11 return 5*2 



Notice that at the end of the execution of the algorithm, each vertex gadget is assigned 
either a type a or a type b solution, and that each row is assigned to one of those solutions. As 
type a solutions are optimal, we can concentrate on type b solutions. 

Clusters corresponding to type b solutions are built iteratively at line 3-9 of Alg. [TJ More 
precisely at each iteration the algorithm examines a set of clusters Ci , • ■ ■ ,Ci of Si and it 
extracts a cluster G containing at least one unmarked edge gadget. Then the algorithm imposes 
a type b solution on a set V{C) of vertex gadgets and on a set E' of edge gadgets so that 
U{C) C E' . Such step is the only one where the virtual cost of some rows can be modified. 
More precisely only edges of the core vertex gadgets in V{C) and edge gadgets in E' may have 
in S2 a virtual cost different from that in Si. 

Notice that by Lemma [HJ each edge gadget in E' — U{G), has virtual cost of at least 12 in 
solution Si, and virtual cost 12 in solution 52. Hence the virtual cost of such edge gadget is 
not increased by Alg. [TJ 

Now, we consider the rows associated with core vertex gadgets in V{G) and edge gadgets 
in U{G). For simplicity's sake, let us denote by virts{y{C)) the sum of the virtual cost of the 
set of rows associated with core vertex gadgets in V{G) in a solution S and similarly, let us 
denote by virts{U{C)) the sum of the virtual cost of the set of rows associated with unmarked 
edge gadgets in U{G). 

Observe that, by construction, the sets U{C) considered in different iterations of Alg. [TJ are 
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pairwise disjoint, therefore it makes sense to analyze each iteration separately. Consequently, 
it is immediate to conclude that the correctness of the Alg. [1] can be proved by showing that, 
in a generic iteration of Alg. [H the following lemma holds: 



Lemma 15. Let C he the cluster containing an unmarked edge gadget found at line 3 of Alg. \^ 
Then virts^{V{C)) - virts^{V{C)) < virts^{U{C)) - virts,{U{C)). 

Proof. The proof consists of several cases. For each case it will suffice to determine that one 
of the following conditions hold: 

(i) virts,{U{C))-virts,{U{C)) > 18\V{C)\; 

(ii) virts,{ViC)) + virts,{U{C))>99\V{C)\ + U\U{C)\; 

(iii) virts^iViC)) - virts,{V{C)) < virts,{U{C)) - virts^{U{C)). 

Notice that conditions (i) and (ii) imply condition (iii). First we show that condition (i) 
implies condition (iii). Assume that condition (i) holds, that is virts^{U{C)) — virts2{U{C)) > 
18|V(C)|. Solution 52 builds a type a or a type b solution for V{C) (whose cost is at most 
99 for each vertex gadget), while we know that the optimal solution for V{C) has cost at 
least 81 (Lemma[7]), which implies that virts2iV{C)) — virtsi{V{C)) < 18|y(C)|, and hence 
virts^iViC)) - virts,{V{C)) < virts,{U{C)) - virts^iUiC)). 

Now we show that condition (i) implies condition (iii) Assume that conditions (ii) holds, 
that is virtsAy{C)) + virts^{U{C)) > 99|y(C)| + 12\U{C)\. As by construction of solu- 
tion 52 virts^ivlc)) + virts^iUiC)) = 99|y(C7)| + 12|C/(C)|, it follows that virts^iViC)) - 
virtsAV{C)) < virtsAU{C)) - virts,{U{C)). 

We will distinguish several cases, depending on the structure of U{C). Recall that Remark 
[2]implies |C| < 5, hence |f/(C)| < 5. Notice also that, by construction, |y(C)| < |C/(C)|, and 
that virts^iy{C)) — virts^iViC)) < 18[y(C)| as the set of rows of each vertex gadget in V{C) 
has a total cost of at least 81 in solution 5i and at most 99 in solution 52. 

• Assume that |f7(C)| > 3. By Lemma [TTl the virtual cost (in Si) of each edge gadget in 
U{C) is at least 36. Therefore virts,{V{C)) + virts,{U{C)) > 81\V{C)\ + 36\U{C)\ = 
81\V{C)\ + 12\U{C)\ + 24|[/(C)| > 81|y(C7)| + 12\U{C)\ + 24|y(C7)| = 12\U{C)\ + 
105|y(C)| > 12j[/(C)| + 99|y(C)|, as required by condition (ii). 

• Assume that |C/(C)| = 3 and no two gadgets in U{C) are incident on a common vertex 
gadget. By Lemma [lOl the virtual cost (in Si) of each edge gadget in U{C) is at least 
of 36. We can apply the same analysis of case |f/(C)| > 3 to show that virtsi{V{C)) + 
virtsAU{C)) > 12\U{C)\ + W5\V{C)\ > 12\U{C)\ + 99|y(C7)|, as required by condition 
(ii). 

• Assume that |C/(C)| = 3 and two gadgets in U{C) are incident on a common vertex 
gadget. By Lemma[10l the virtual cost (in Si) of each edge gadget in U{C) is at least of 
27, but notice that \V{C)\ < 2. Theiefoie virts,{V{C))+virts,{U{C)) > 27-3+81|y(C)|. 
It is immediate to notice that 27 • 3 + 81\V{C)\ > 12 • 3 + 99|y(C)| when \V{C)\ < 2, as 
required by condition (ii). 
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Assume that |C/(C)| = 2 and such two gadgets EGi, EG2 in U{C) are not incident 
on a common vertex gadget. By Lemma [HI the virtual cost in Si of each edge gad- 
get in U (C) is at least 27, while by Prop. H] the virtual cost in 5*2 of each edge gad- 
get in ulc) is exactly 12. Hence virts,{U{C)) - virts^{U{C)) > 30. If \V{C)\ = 1 
then virts^{V{C)) - virts^{V{C)) < 18 and, a fortiori, virts^{V{C)) - virts^{V{C)) < 
virtsi{U{C)) — virts2{U{C)), as required by condition (iii). Therefore we are only inter- 
ested in the case \V{C)\ > 2. Since \V{C)\ < \U{C)\ = 2, we can assume that \V{C)\ = 2 
and, consequently, virts2{V{C)) — virtsi{V{C)) < 36 = 18|y(C)| (this observation will 
be used in the reamining part of this case). 

Let r be a row in C which is not an edge gadget (one must exist because |C[ > 3 and 
there are exactly two edge gadgets in C). We have to distinguish two cases, according to 
the fact that r is a jolly row or a row of a vertex gadget. 

First consider the case when r is a jolly row. By Lemma[8]at least 27 entries of the vertex 
blocks in the rows of C are deleted. Since r is a jolly row virts^{EGi),virtsi{EG2) > 
27|C|/(|C7|-1) > 33, as \G\ < 5. Therefove virts,{U{G))-virts2{U{C)) = virts,{EGi) + 
virts^{EG2) - virts^iEGi) - virts2{EG2) > 33 • 2 - 12 • 2 = 42, which implies that 
virts2{V{G)) + virts2{U{G)) < virts-^{V{G)) + virtsj^{U{G)), as required by condition 
(iii). 

Assume now that r is a row of a vertex gadget X, we have to consider two subcases 
depending on the fact that r is adjacent or not to EGi or EG2- Assume that r is not 
adjacent to EGi nor to EG2- By LemmaEl virts^ {U{G)),virts2 {U (C)) > 30. As we have 
assumed that |1^(C) = 2| and virts2{V{G)) — virtsj^{V{G)) < 36, we can immediately 
prove that virts2{V{C)) + virts2{U{G)) < virtsi{V{G)) + virts^{U{G)), as required 
by condition (iii). The last subcases that we have to consider is when r is adjacent 
to EGi or EG2- Assume w.l.o.g. that r is adjacent to EGi and that X C V{G). 
Observe that r is co-clustered in S2 in a type a or in a type b solution, hence by Prop. 
\Mvirts2ir) < 12, while virtsi{r) > 27, as virts^{EGi) > 27 by Lemma El Taking 
into account the fact that virts^ir) — virts2{f) > 15, as r is a row of X C V{G), we 
immediately obtain that virts2iy{C)) — virtsi{V{G)) < 18 • 2 — 15 = 21 which can be 
coupled with virts^{U{G)) — virts2{U{G)) > 30 proved before to obtain virts2{V{G)) + 
virts2{U{G)) < virtsi{V{G)) + virts-^{U{G)), as required by condition (iii). 

Assume that |J7(C)| = 2 and that the edge gadgets EGi, EG2 in U{G) are incident on a 
common vertex gadget. Hence, by construction of the algorithm, |1^(C)| < 1, therefore 
virts2{y{C)) — virts^{V{G)) < 18. By Lemma El the virtual cost (in ^i) of each edge 
gadget in [/(C) is 21, therefore virts,{U{G)) - virts2{U{G)) > (21 - 12) • 2 = 18, as 
required by condition (i). 

Assume that U{G) = {EGi^^} and EGi^h is clustered with two rows ri, r2 from differ- 
ent vertex gadgets VGi, VGj. Since \U{G)\ = 1, \V{G)\ < 1, hence virts2{V{G)) - 
virtsi(y{G)) < 18. By Lemma [T2l virtsi{EGi^h),virts^{ri),virtsi{r2) > 18, while by 
Prop. ^virts2{EGi^h), virts2{ri), virts2{r2) < 12, hence virtsi{EGi^h) -virts2iEGi^h) > 
6 while virts2(y{G)) — virts-i^{V{G)) < 6 which immediately implies virts2iV{G)) + 
virts2{U{G)) < virts-^{V{G)) + virts-^{U{G)), as required by condition (iii). 

Assume that U{G) = {EGi^h} and that G contains the edge gadget EGi^h and (at least) 
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two rows ri, r2 oiVGj, with j 7^ Since |C/(C)| = 1, \V{C)\ < 1, hence virts^iV{C)) - 
virts,{V{C)) < 18. 

Initiahy we will prove that for each row r of C the virtual cost virts^ir) > 21. In 
fact ri and r2 may share a gadget encoding operation and a vertex encoding opera- 
tion (if they are incident on a common vertex), therefore there are three distinct gad- 
get encoding operations and four distinct gadget encoding operations overall, and none 
of such operation is shared by all edges in C. An immediate consequence is that at 
least 21 entries of each row of C must be suppressed, therefore virtsj^{{ri,r2, EGi^h}) — 
virts2{{'''i,r2, EGi^h}) ^ (21 — 12) • 3 = 27, as by Prop. [3] each row of C in S2 has 
virtual cost at most 12. For bookkeeping purposes we attribute the entire value of 
virtsi{{ri,r2, EGi^h}) — virts2{{fi-:r2,EGi^h}) to EGi^h (this bookkeeping trick is pos- 
sible as a row r is allowed to give its "credit" only to an edge gadget with which it 
is co-clustered in Si), therefore we obtain virts^{U{G)) — virts2{U{C)) > 27 > 18, as 
required by condition (i). 

• Assume that U{G) = {EGi^i} and that C contains the edge gadget EGi^i and (at 
least) two rows ri, r2 of VGi, with r2 not adjacent to EGi^i. Notice that, by case 
8 of Property [3l H{r2, EGi^i) > 15. Therefore the virts^{{EGi^i,ri,r2}) > 45, while 
virts2{{EGu,n,r2}) < 36 by Prop. H 

In what follows we will consider the virtual costs of the rows in VGi. More precisely, 
let T be the set VGi — {^i,f2}', we will show that there exists a row r such that 
virtsi{r) > 12. 

Assume initially that there exists a row r £ T that is clustered with an edge gadget or 
with a row belonging to a different vertex gadget VGj. Then virtsi {r) > 12 by LemmafMl 
and case 1 of Propl3l Hence, we can assume that the rows in T are clustered only with 
rows of VGi, which implies that ri and r2 are not clustered together with any row in T, 
as ri and r2 are clustered with EGi^i. 

Since T contains 7 rows, a trivial counting argument shows that 4 rows of T are clustered 
together, or there is a row r £ T that is clustered with a jolly row of VGi. Indeed, if 
four rows of T are clustered together, by construction there are two of those four rows 
that are not incident on a common vertex, therefore an immediate application of case 4 
of Prop. [3] gives the desired result. If r is clustered with a jolly row then, by Lemma EJ 
virts^{rui) > 12. 

Now we know that there is a row r in VGi — {i'i^f2} such that virts^{r) > 12. Moreover 
virtsi{{EGij,ri,r2}) > 45 and wrtsj ({E'Cj,;, n, r2}) < 36. 

By Lemma[7]the virtual cost of any row of VGi different from r, ri, r2 is at least 9. Since 
virtsiin) = virts,{r2) > 15, virts^{VGi) = 6-9-hl2-hl5-2 = 96, while virts^iVGi) = 99 
(since 5*2 has a type b solution for the rows in VGi). 

Since virts^{EGi^j) = 15 and virts2iEGij) = 12 by Prop. [H it is immediate to obtain 
that virts-i^(y{G))+virts-i^{U{G)) > virts2(y{G))+virts2{U{G)) as required by condition 
(iii). 

• Assume that U{G) = {EGi^h} and that C contains at least a jolly row ej of a vertex 
gadget VGi or VGh (w.l.o.g. VGi). We assume that VGi C V{C). Furthermore, 
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we can assume that no other row of a different vertex gadget belongs to C otherwise 
the previous cases hold. By case 2 of Prop. [3] the Hamming distance of EGi^h ^-nd 
ej is at least 14, therefore virts^iC) > 14|C|, while virts2{C) < 12(|C| — 1), hence 
virts^{C) - virts^iC) > 2{\C\ - 1) + 14 and, since |C| > 3, virts^iC) - virts^iC) > 18. 
Once again, we attribute the entire value of virtg^ (C) — virts^ (C) > 18 to EGi^h (this 
bookkeeping trick is possible as a row r is allowed to give its "credit" only to an edge 
gadget with which it is co-clustered in Si), therefore virts^{U{C)) — virts2{U{C)) > 18, 
as required by condition (i). 

The proof is completed by the observation that the only possible case that is not explicitly 
considered in the above cases is when an unmarked edge gadget is clustered in Si only with 
rows of the core vertex gadget VGi and all rows share a common vertex. In such case, Alg. [1] 
does not modify the clustering, as V{C) is made only by the vertex gadget VGi and ViC) has 
a type b solution in ^2. □ 

Lemma 16. Let S be a solution of an instance of3-ABT associated with an instance of MVCC, 
then we can compute in polynomial time a canonical solution Sc such that c{Sc) < c(S'). 

Theorem 17. Let Q = iy, E) be an instance of MVCC. Then Q has a cover of size p if and 
only if the corresponding instance R of 3-ABT has a (canonical) solution S of cost 99p+81(n — 
p) + 12m. 

Proof. Let us show that if G has a vertex cover Vc of size p, then R has a solution S of cost 
99p + 81{n — p) + 12m. Since Vc is a vertex cover then it is possible to construct a canonical 
solution S for R consisting of a type b solution for all vertex gadgets associated with vertices in 
Vc and a type a solution for all other vertex gadgets. Indeed each edge gadget can be clustered in 
a type b solution of a vertex gadget to which the edge is incident, choosing arbitrarily whenever 
there is more than one possibility. Finally, for each docking vertex, its jolly rows that are not 
used in some type b solution are clustered together. The cost derives immediately by previous 
observations. 

Let us consider now a solution S of 3-ABP over instance R with cost 99p + 81(n — p) + 12m. 
By Lemma [16] we can assume that S is canonical solution, therefore R has a set Vc of p vertex 
gadgets that are associated with a type b solution. By construction, each edge gadget must be 
in a type b solution, for otherwise 5 is a not canonical solution. Hence the set of vertices of Q 
associated with vertex gadgets in is a vertex cover of Q of size p. □ 

Since the cost of a canonical solution of 3-ABP and the size of a vertex cover of the graph 
G are linearly related, the reduction is an L-reduction, thus completing the proof of APX- 
hardness. 

4 APX-hardness of 4-AP(8) 

In this section we prove that the 4-anonymity problem is APX-hard even if all rows of the 
input table have 8 entries (this restriction is denoted by 4-AP(8)). More precisely, we give an 
L-reduction from Minimum Vertex Cover on Cubic Graphs (MVCC) to 4-AP(8). 

Given a cubic graph Q = {V,E), with V = {vi, . . . ,Vn} and E = {ei, . . . ,em}, we will 
construct an instance R of 4-AP(8) consisting of a set Ri of 5 rows for each vertex Vi S V, an 
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edge row r{i, j) for each edge e = {vi,Vj) £ E and a set F of 4 rows. The 8 columns are divided 
in 4 blocks of two columns each. For each vertex Vi, all the rows in Ri have associated a block 
called edge block, denoted as b{Ri), so that b{Ri) ^ b{Rj) for each Vj adjacent to Vi in Q. The 
latter property can be easily enforced in polynomial time as the graph is cubic. 

The entries of the rows in R^ = {rj i, . . . , 5}, are over the alphabet ^{Ri) = {0,^1, . . . , Oj^s, Oj}. 
The entries of the columns corresponding to the edge block b{Ri), as well as to the odd columns 
are set to for all the rows in Ri. The entries of the even columns not in b{Ri) of each row 
Vi^h are set to Oj^/i. 

For each edge e = {vi,Vj), we define a row r{i,j) (called edge row) of R. Row r{i,j) has 
value Oj (equal to the values of the rows in Ri) in the two columns corresponding to the edge 
block b{Ri), value aj (equal to the values of the rows in Rj) in the two columns corresponding 
to the edge block b{Rj), and value tij in all other columns. Given a set of rows Ri, we denote 
by E{Ri) the set of rows r{i,j), r{i, I), r{i, h), associated with edges of Q incident in Vi. Finally, 
we introduce in the instance R of 4-AP(8) a set of 4 rows F = {fi, f2, fs, fi}, over alphabet 
S(F) = {ui . . . , U4}. Each row fi is called a free row and all its entries have value Ui. 

Since all tables have 8 entries, w.l.o.g. we can assume that there exists only one cluster 
Fc, called the filler cluster, whose cost is equal to 8|Fc|. In fact, if there exists two clusters Fc, 
F'^ exist, whose cost is equal to 8|FcI and 8|-F^| respectively, then we can merge them without 
increasing the cost of the solution. The free rows must belong to F^ as each free row has 
Hamming distance 8 with all other rows of R. Notice that, by construction, $](i?j)nS(i?j) = 0, 
hence two rows have Hamming distance smaller than 8 only if they both belong to Ri U E{Ri) 
for some i. This observation immediately implies the following proposition. 

Proposition 18. Let S be a solution of an instance of 4:-AP(8) associated with an instance 
of MVCC and let C be a cluster of S where each row in C has cost strictly less than 8. Then 
C C R,UE{Ri). 

Since in Ri\JE{Ri) there are 8 rows, there can be at most two sets having rows in Ri\JE{Ri) 
and satisfying the statement of Proposition [I8l Consider a solution S and a set of rows Ri. 
We will say that 5 is a black solution for Ri if in S there is a cluster containing 4 rows of Ri 
and a cluster containing one row of Ri and the three rows of E[Ri). We will say that S is a 
red solution for Ri if in S there is a cluster consisting of all 5 elements of Ri. By an abuse 
of language we will say respectively that Ri is black (resp. red) in S. Given an instance R 
of 4-AP(8), a solution where each set Ri is either black or red is called a canonical solution. 
Notice that a canonical solution consists of a filler cluster and a red or black solution for each 
Ri. The main technical step in our reduction consists of proving Lemma [22l which states that, 
starting from a solution S, it is possible to compute in polynomial time a canonical solution 
5" with cost not larger than that of S. To achieve such goal we need some technical results. 

Next we show that moving the rows of Ri that are in the filler cluster to another existing 
cluster that contains some rows of Ri (if possible) or to a new cluster, does not increase the 
cost of the solution. 

Lemma 19. Let S be a solution of an instance of A-AP(%) associated with an instance of 
MVCC, and let r be a row of Ri. Then at least three even entries of r that are not in the edge 
block are deleted. 

Proof. The lemma follows from the property that a row r £ Ri must be co-clustered with at 
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least three other rows, and that r disagrees with any other row of R in the even entries not in 
the edge block of □ 



Lemma 20. Let S be a solution of an instance of 4-AP(8) associated with an instance of 
MVCC. Then we can compute in polynomial time a solution S' with cost not larger than that 
of S and such that in S' there exist at most two sets containing some rows of Ri. 

Proof. Consider a generic set of rows R^; clearly if at most two sets of S contain some rows 
of Ri, then S satisfies the lemma, therefore assume that in S there are at least three sets 
containing some rows of Ri. Let Cf, . . . , Cf be the clusters of S containing rows of Ri. By a 
simple counting argument, at most one of the clusters (w.l.o.g. let C} be such cluster), can 
be a subset of Ri U E{Ri), all other clusters Cj contain some rows not in Ri U E{Ri) (as well 
as some rows in Ri U E{Ri) by construction), therefore the cost of each row of C?, . . . , is 
8. Move all the rows of Cf,..., Cf to the filler cluster to obtain a solution whose cost is not 
larger than that of the original solution. At the same time the resulting solution has exactly 
two clusters containing some rows of Ri. Repeating the process for all sets Ri completes the 
proof. □ 

Hence, in what follows we assume that in any solution there are at most two sets containing 
rows of each set Ri. 

Lemma 21. Let S be a solution of an instance of 4:-AP(8) associated with an instance of 
MVCC. Then it is possible to compute in polynomial time a solution S' , whose cost is not 
larger than that of S, such that the filler cluster Fc of S' consists of all free rows and some 
(possibly zero) edge rows. Moreover in S' there are at most two clusters containing rows of Ri. 

Proof. Consider a generic set Ri. By Lemma [20l we already know that there are at most two 
clusters of S containing some rows of Ri. Assume initially that there exists only one cluster of 
S containing some rows of Ri. If such cluster is the filler cluster, then remove all rows of Ri 
from the filler cluster and make Ri a cluster of S' . In the resulting solution none of the rows 
of Ri are in the filler cluster. 

Consider now the case that there are exactly two clusters Ci, C2 of S containing some rows 
of Ri. If one of those clusters (say C2) is the filler cluster then move all rows of Ri that 
are in Fc to Ci, obtaining two clusters Ci U Ri and Fc — Ri. Notice that before moving the 
rows of C2 to Ci, at least one even position not in b{Ri) is deleted for each row in Ci. As each 
row moved from C2 to Ci differs from any other rows of Ci in at most two not yet deleted 
entries, and all the entries of block b[Ri) are equal for the rows in Ri U E(Ri), it follows that 
this change does not increase the cost of the solution. □ 

Now we are ready to prove Lemma [22l 

Lemma 22. Let S be a solution of an instance of 4:-AP(8) associated with an instance of 
MVCC. Then it is possible to compute in polynomial time a canonical solution S' with cost not 
larger than that of S. 

Proof. Consider a generic set of rows Ri . By Lemma [2T] no row of Ri belongs to the filler 
cluster. Therefore, if Ri is neither red nor black in S, then the rows of Ri can be partitioned 
in S in one of the following two ways: (i) a cluster Ci contains three rows of Ri and a row 
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of E{Ri), while C2 contains two rows of Ri and two rows of E{Ri), or (ii) a cluster C of 5 
contains all rows of Ri and some rows of E{Ri). 

In the first case, replace Ci and C2 with two clusters C{, C'2, where C( consists of 4 rows 
of Ri and C'2 consists of a row of Ri and all rows of E{Ri) (it is immediate to notice that C(, 
C2 have cost 12 and 24 respectively, while Ci and C2 have both cost 24). 

In the second case let X be the set C fl E{Ri), replace C with cluster C" = i?i and move 
all rows in X to the filler cluster. Let x = \X\, then the cost of C in S" is 6(5 + x), while the 
cost of C and X in the new solution is equal to 3 • 5 + 8 • x. Since 2; < 3, the cost of the new 
solution is strictly smaller than that of S*. □ 

Notice that, given a canonical solution S, each red set Ri in S has a cost of 15, each black 
set Ri in S has a cost of 36 (that is a cost of 12 associated with the rows of Ri and a cost of 
24 associated with 3 edge rows in the black solution of Ri), and the filler cluster Fc has cost 
8|-Fc|- Now, it is easy to see that Lemma [23] holds. 

Lemma 23. Let S be a canonical solution with k red sets Ri of an instance of 4:-AP(8) asso- 
ciated with an instance of MVCC. Then S has cost 12(|y| — k) + 15k + 8|ii^| + 32. 

Now, we can show that the sets of rows Ri that are red in a canonical solution S corresponds 
to a cover of the graph Q. 

Lemma 24. Let S be a canonical solution of cost 12{\V\ — k) + 15k + 8[-E| + 32 of an instance 
of 4:-AP(8) associated with an instance of MVCC. Then it can be computed in polynomial time 
a vertex cover of Q of size k. 

Proof. Since is a canonical solution of 4-AP(8) of cost 12(|1/| — A;) + 15k + 8|i?|, then all the 
sets Ri must be associated with either a red or a black solution. Furthermore, since all the 
edge rows have a cost of 8 in 5, then there must exist k sets Ri associated with a red solution, 
and |y| — /c sets associated with a black solution. 

Notice that, given two black sets Ri and Rj, there cannot be an edge between two vertices 
Vi and Vj of Q associated with Ri and Rj, by definition of black solution. Hence, the set of 
vertices associated with black sets of S is an independent set of Q, which in turn implies that 
the vertices associated with red sets are a vertex cover of ^. □ 

Theorem 25. The ^-AP(8) problem is APX-hard. 

Proof. Let C be a vertex cover of graph Q. Then, it is easy to see that a canonical solution S of 
the instance of 4-AP(8) associated with Q such that S has cost at most 12|y| +3|C| + 8|-E| + 32 
can be computed in polynomial time by defining a black solution for each set Ri associated 
with a vertex G F \ C, a red solution for each set Ri associated with a vertex Vi G C, and 
assigning all the remaining rows to the filler cluster Fc- 

On the other side, by Lemma [2^ starting from a canonical solution of 4-AP(8) with size 
12([y [ — k) + 15k + 8\E\ + 32, we can compute in polynomial time a cover of size k for Q. Since 
the cost of a canonical solution of 4-AP(8) and the size of a vertex cover of the graph Q are 
linearly related, the reduction is an L-reduction, thus completing the proof. □ 



20 



5 Acknowledgements 



PB and GDV have been partially supported by FAR 2008 grant "Computational models for 
phylogenetic analysis of gene variations" . PB has been partially supported by the MIUR PRIN 
grant "Mathematical aspects and emerging applications of automata and formal languages" . 



References 

[1] G. Aggarwal, T. Feder, K. Kenthapadi, S. KhuUer, R. Panigrahy, D. Thomas, and A. Zhu. 
Achieving anonymity via clustering. In S. Vansummeren, editor, PODS, pages 153-162. 
ACM, 2006. 

[2] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. 
Anonymizing tables. In T. Eiter and L. Libkin, editors, ICDT, volume 3363 of Lecture 
Notes in Computer Science, pages 246-258. Springer, 2005. 

[3] G. Aggarwal, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Ap- 
proximation algorithms for k-anonymity. J. Privacy Technology, 2. 

[4] P. Alimonti and V. Kann. Some APX-completeness results for cubic graphs. Theoretical 
Computer Science, 237(1-2):123-134, 2000. 

[5] G. Ausiello, P. Crcsccnzi. V. Gambosi, G. Kann, A. Marchctti-Spaccamcla, and M. Protasi. 
Complexity and Approximation: Combinatorial optimization problems and their approx- 
imability properties. Springer- Verlag, 1999. 

[6] A. Gionis and T. Tassa. fe-anonymization with minimal loss of information. In L. Arge, 
M. Hoffmann, and E. Welzl, editors, ESA, volume 4698 of Lecture Notes in Computer 
Science, pages 439-450. Springer, 2007. 

[7] H. Park and K. Shim. Approximate algorithms for k-anonymity. In C. Y. Chan, B. C. 
Ooi, and A. Zhou, editors, SIGMOD Conference, pages 67-78. ACM, 2007. 

[8] P. Samarati. Protecting respondents' identities in microdata release. IEEE Trans. Knowl. 
Data Eng., 13(6):1010-1027, 2001. 

[9] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing 
information (abstract). In PODS, page 188. ACM Press, 1998. 

[10] L. Sweeney, k-anonymity: a model for protecting privacy. International Journal on Un- 
certainty, Fuzziness and Knowledge-based Systems, 10(5):557-570, 2002. 



21 



